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PAQ 

1 0verview Of changes VS. XPP V2.0 
1.1 ALU-PAE Architecture 

A PAE comprises 4 input ports and 4 output ports. Embedded with each PAE is the , 
FREG path newly named DF with its dataflow capabilities, like MERGE, SWAP, 
DEMUK as well as ELUT. 

i input ports RiO and Ri1 are directly connected to the ALU. Two ou^ut ports receive 
the ALU results. 

Ri2 and Ri3 are typically fed to the DF path which output is Ro2 and Ro3. 
Alternatively Ri2 and Ri3 can serve as inputs for the ALU as well. This extension is 
needed to provide a suitable amount of ALU Inputs if FuncSon Fddmg (as described 
later)^ is used. In this mode Ro2 and Ro3 senm as additional outpute. 

Associated to each data register (Ri or Ro) is an event port (El or Eo). 

It is to decide whether an additional data and event bypass 
I BRiO-1,BEiO-1 is impleinented. The decision depends on how 

i often Function Folding will be used and how many inputs and 

* outputs are required in average. 
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1.1.1 other extensions 

SIMD operation is implemented in the ALUs to support 8 and16 bit wide data words 
for Le* graphics and imaging. 

Saturation is supported for ADD/SUB/IVIUL instructions for i.e. voice, video and 
imaging algorithms. 

1.2 Function Folding 

1.2.1 Basics and inpuVoutput paradigms 

Within this chapter the basic operation paradigms of the XPP architecture are 
repeated for a better understanding based on Petrl-Nets. in addition the Petri-Nets 
wilt be enhanced for a better understanding of the subsequently described changes 
of the current XPP architecture. 

Each PAEs operates as a data flow node as defined by Perti-Nets. A Petri-Net 
supports a calculation of multiple inputs and produces one single output. Special for 
a Perti-Net is, that the operation is delayed until all inputs are available. 

For the XPP technology this means: 

1 . all necessary data is available 

2. all necessary events are available 

The quantity of data and events is defined by the data and control flow, the 
availability is displayed at runtime by the handshake protocol RDY/ACK. 



The thick arbor indicates the operation, the dot on the right side indicates that the 
operation is delayed until all inputs are available. 

Enhancing the basic methodology function folding supports multiple operations - 
maybe even sequential - instead of one, defined as a Cycle, important is, that the 
basics of Petri-Nets keep unchanged. 
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TvDical PAE like Petri-Nets consume one input packet per one operation. For . > 
Xenti^opS.ti^^^^^^^ reads of the same input packet are supported. However, 
the interface model again keeps unchanged. 

Data duplfeatton occurs In the output path of the Petrl-Net. which does.not influence 
the oper^on basics again. 
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1 22 Method of Function Folding 

One of the most Important extensions is the capability to fold multiple PAE func^ons 
?n PAE and execute them in a sequential manner. It is important to understand 
ttiat the totion is not to support sequential processing or even microconfroller 
SaSi aS^B intentiSS of Function Folding is just ^ tekejnultiple d Ja^ow 
SJSations and map them ori a single PAE. using a register structure instead of a 
network between each fiinction. 

The goal is to save silicon area by dsing to clock frequency locally in tje PAEs. An 
additional expectation Is to save pov^r since the busses operate at a fraction of the 
SfrVquencies of the PAEs. Data transfers over the busses. wh.ch consume much 
power, are reduced. 
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The Internal registers can be Implemented in two different ways: 

1. dataflow model ^ ^ ^ 

Each register (r") lias a valid bit which is set as soon as data has been written into the 
register and reset after the data has been read. Data cannot be written if valkJ is set. 
d^a can not be read "if valid is not set This approach implements a 100% compatible 
dataflow behaviour. 

2. sequencer model 

The registers have no associated valid bits. The PAE operates as a sequencer, 
whereas at the edges of the PAE (the bus connects) the paradigm is changed to the 
XPP-IIke dataflow behaviour. 

Even if at first the dataflow model seems preferable, it has major down sides. One is 
that a high amount of register is needed to implement each data path and data 
duplication is quite complicated and not efficient. Another is that sometimes a limited 
sequential operation simplifies programming and hardware effort. 
Therefore it is assumed consecutively that sequencer model is implemented. Since 
pure dataflow can be folded using automatic tools the programmer should stay within 
the dataflow paradigm and not be confused with the additional capabilities. Automatic 
tools must take care i.e. while register allocation that the paradigm is not violated. 

The following figure shows that using sequencer model only 2 registers (instead of 4) 
are required: 
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For allowing complex function like Le. address generation as well as algorithms like 
-iMEC^Te dTta stream operations , 
^ptemS^ mt ^^Zs the maximum bus-dock vs. PAE-cIock ration is Kmrted 

to a factor of 4 for usual function folding. 

It is expected that the size of the new PAE supporting Function Folding will InctBase 
by max. 25%. On the other hand 4 PAEs are reduced to 1 . 

Assumina that In average not the optimum but only about 3 functions can be fold«l 

a SngTe pke a XPP64 could be replaced by a XPP21 . Taking ^^^^^^^^^^ 
into account the functionality of a XPP64 V2.0 should be executable on a.XPP V2.2 
viwth an area of less than half. 



1.3 Array Structure 

The V2 0 stmcture of the PAEs consumes much area for FREG and BREG and their 
asMctated bus Interfaces. In addition feed backs through the FREGs require the 
insertion of registers Into the feedback path, which result not only in an increased 
latency but also In a negative impact onto the throughput and perfonnance of the 
XPP, 

A new PAE structure and arrangement is proposed with the expectation to rninimize 
latency and optimize the bus Interconnect structure to achieve an optimized area. 
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The V2.2 PAE structure does not include BREGs any more. As a replacement the 
ALUs are alternating flipped horizontally which leads to improved placement and 
routing capabilities especially for feedback paths I.e. of loops. 
Each PAE contains now two ALUs and two BP paths, one from top to bottom and 
one flipped from bottom to top. 
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1.4 Bus modifications 

Within this chapter are optimizations described which reduce the 
required area and the amount of busses. However, this 

(modifications are only proposals yet, since the have to be 
evaluated based on real algorithms. It is planed to compose a 
• questionnaire to collect the necessary input from the application 

programmes. 

1.4.1 NexS neighbour 

In V2.0 architecture a direct horizontal data path between two PAEs block a vertical 
data bus. This effect increases the required vertical busses wrthin a XPP and drives 
cost.unnecessarily. 

Therefore in V2.2 a direct feed path between horizontal PAEs Is proposed. 



1 .4 J2 Removal of registers In busses 

In V2.0 are registers implemented in the vertical busses which can be switched on by 
configuration for longer paths. This registers can furthermore b© preloaded by 
configuration which requires a significant amount of silicon area. 

It is proposed not to Implement registers in the busses any more, but to use an 
enhanced DP or Bypass (PB) part within the PAEs which is able to reroute a path to 
the same bus using the DP or BP Internal registers instead: 
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It is to evaluate 

a) how many resources are saved for the busses ana how 
many are needed for the PAEs 

b) how often must registers be inserted, are 1 or max. 2 
paths enough per PAE (fimit is two since DP/BP offers 
max. 2 inputs) 



1.4^ Shifting n:1, 1:n capabilities from busses to PAEs 

In V2 0 n:1 and 1 :n transitions are supported by the busses which requires a 
significant amount of resources I.e. for the sample-and-hold stage of the handshake 
signals. 

Depending on the size of n two different capabilities are provided with the new PAE 
structure* 

n. i 2 The required operations are done wfthln the DF path of the PAE 
2in^4 The ALU path is required since 4 ports are necessary 
n > 4 Multiple ALUs have to be combined 

This method saves a significant amount of static resources in silicon but requires 
dedicated PAE resources at runtime. 



It is ttierefore to evaluate 

c) how much silicon area, is saved per bus 

d) how often occurs n^, 2^n^4,n>4 
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e) Ihe ratio between saved silicon area and required PAE 
resources 



I^FSMinRAM-PAEs 

In the V2.0 architecture Implementing control stmctures Is very costly, a lot of 
resources are required and programming Is quite difficult 

However memories can be used for a simple FSMs Implementation. The following 
enhancement of the RAM-PAEs offers a cheap and easy to program solution for 
many of the known control issues, including HDTV. 
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Basically the RAM-PAE is enhanced by an feedback from the data output to the 
address input through a register (FF) to supply subsequent address within each 
stage. Furthermore additional address Inputs from the PAE array can cause 
conditional jumps, data output will generate event signals for the PAE an^ay. 
Assocfated counters which can be reloaded and stepped by the memory output 
generate address input for conditfonal jumps (i.e. end of line, end of frame of a video 

ASrofcal RAM-PAE Implementation has about 16-32 data bits but only 8-12 address 
bits To optimize the range of input vectors it is therefore suggestive to Insert some 
multiplexers at the address inputs to select between multiple vectors, whereas the 
multiplexers are controlled by some of the output data bits. 

The implementation for a XPP having 24bit wide data busses Is sketched in the next 
figure 4 event inputs are used as input, as well as the lower for bits of input port RiO. 
3 counters are implemented, 4 events are generated as well as the lower 10 bits of 
the RoO port. 



The memory organisation is as follows: 
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8 address bits 

24 data bHs (22 used) 

4 next address 

8 multiplexer selectors 

6 counter control (shared with 4 additional next address) 
4output 



PACT 



Em 


Eil 


RK) 


Ril 



nextaddre^Csharad) 



Ri2 


Ri3 


E12 


B3 






\ 






EoO 


Eol 




Rol 



-Ro2 


Ro3 


Eo2 


Eo3 



CNT2 



CMTS 



Please not that the typical memory mode of the RAM-PAE is not sketched in the 
block diagram above. 

The width of the counters is according to the bus width of the data busses. 



For a 16 bit implementation it is suggested to use the carry signal of the counters as 
their own reload signal (auto reload), also some of the multiplexers are not driyen by 
the memory but "hard wired" by the configuration. 



The proposed memory organisation is as follows: 
8 address bits 
16 data bits (16 used) 

4 next address 

4 multiplexer selectors 

3 counter control (shared with 3 additional next address) 

4 output 
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Actually the RAM-PAEs are not scaleable any more since the 
16-blt implementation is different from the 24-bit implementation. . 
It is to decide whether the striped down 16-bit implementation is 
used for 24'blt also. 



1.6 lOAG Interface 

1 .6.1 Address Generators and bit reversal addressing 

Implemented within the 10 Interfaces are address generators to Support 1 to 3 
dimensional addressing directly without any ALU-PAE resources. The address 
generation is done by 3 counters, each of them has configurable base address, 

length and step width. . * . . , * r- 

The first counter (CNT1) has a step Input to be controlled by the array of ALU-PAEs. 
Its carry is connected to the step Input of CNT2, which carry again is connected to 
the step input of CNT3. 

Each counter generates carry if the value is equal to the configured length. 
Immediately with cany the counter is reset to Its configured base address. 

One input is dedicated for addresses from the array of ALU-PAEs which can be 
added to the values of the counters. If one or more counters are not used they are 
configured to be zero. 
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In addHion CNTI supports generation of bit reversal addressing by sgpplying mu^iple 
carry modes. 
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1.6i2 Support for.dlfferent word width 

In general it is necessary to support muftiple word width witiiin the PAE array- 8 and 
1 6 bit wide data words are preferred for a lot of algonthms i.e. graphics. In addition to 
the already described 811^40 operation, the lOAG allows the split and merge of such , 
smaller data words. 

Since the new PAE structure allows 4 input and 4 output ports, the lOAG can support 
word splitting and merging as follows: 
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Input ports are merged within the lOAG for word writes to the 10. 

For output ports the read word is split according to the configured word width. 



1.7 XPP/pP coupling 

For a closed coupling of a pP and a XPP a cache and register interface would be the 
preferable structure for high level tools lil<e CK»mpilers. However such a close 
coupling Is expected not to be doable in a very first step- 
Two different kind of couplings are necessary for a tight coupling: 

a) memory coupling for large data streams: The most convenient method with 
the highest performance is a direct cache coupling, where^ an AMBA based 
memory coupling will be sufHclent for the beginning (to be discussed with 
ATAIR) 
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b) register coupling for small data and irregular MAC operations: Preferable is a 
direct coupling Into the processors registers with an implicit synchronisation in 
the OF-stage of the processor pipeline. However coupling via load/store- or 
fn/out-commands as external registers is acceptable with the penalty of a 
higher latency which causes some performance limitation (already agreed with 
ATAIR) 
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2 Specification Of ALU-PAE 

2.1 Overview 

The ALU-PAE comprises 3 paths: ^ ^ ^„ 
ALU arithmetic logic and data flow nandling 
DF data flow handling and bypass 
BP bypass 

Each of the paths contains 2 data busses and 1 event bus. The busses of the DF, 
path can be rerouted to the ALU path by configuration. 

2.2 ALU path Registers 

The ALU path comprises 12 data registers: . . • . 

RiO-3 Input data register 0-3 from bus 
RvO-3 Virtual output data register 0-3 to bus 
RdO-3 Internal general purpose register 0-3 

EiO-3 Event input 0-3 from bus ^ . 

EvO-3 Virtual event output register 0-3 to bus 
Fu, FvFlag u and V according to the V2.0 PAE 

Note: Ri2 and Ri3 belong typically to the DF path, but can be allocated fpr the ALU . 
t)y configuration. 

Eight instruction registers are Implemented, each of them is 16 bit yirfde according to 
the opcode format 

RcO-7 Instruction registers 

Three special purpose registers are implemented: uamidac: 
Ric Loop Counter, configured by CM, not accessible through ALU-P/^ 

itself. Will be decremented according to JL opcode. Is reloaded after.. 

value 0 is reached. . _ " ,^ 

Rjb Jump-Bacic register to define the number of used entnes in Rc[0..7]. it 

is not accessible through ALU-PAE itself. 

If Rpp is equal to Rjb, Rpp Is Immediately reset to 0. The jump bacl< 

can be bound to a condition i.e. an Incoming event If the condition is 

missing, the jump back will be delayed. 
Rpp Program pointer 

2.3 Data duplication and multiple input reads 

Since Function Folding can operate in a purely data stream mode as well as In a . 
sequential mode (see 1 .2) it Is useful to support Ri reads in datafloW:moda (single, . 
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read only) and sequential mpde (multiple read)- The according protocols are 
described below: 

Each input register Rl can be configured to work In one of two different modes: 
Dataflow Mode 

This is the standard protocol of the V2.0 implementation: 

A data packet is taken read from the bus If the register Is empty, an ACK handshake 

is generated. If the register is not empty ACK the data Is not latched and ACK Is not 

generated. 

If the register contains data, it can be read once. Immediately with the read access 
the register is marked as empty. An empty register cannot be read. 

Simplified the protocol is defined as follows: 
RDY& empty -►full ' 
-►ACK 

RDY&full -►notACK 

READ & empty stall 
READ & full read data 

-i empty 

Please note: pipeline effects are not taken into account In this description and 
protocol. 



Sequencer Mode 

The input interface is according to the bus protocol definition: A data packet is taken 
read flrom the bus if the register is empty, an ACK handshake is generated. If the 
register is not empty ACK the data is not latched and ACK is not generated. 
If the register contains data ft can be read multiple times during a sequence. A 
sequence is defined from Rpp = 0 to Rpp = Rjb. During this time no new data can be 
written into the register. Simultaneously with the reset of Rpp to 0 the register 
content is cleared an new data is accepted from the bus. 



Simplified the protocol 
RDY & empty 

RDY&full 

READ & empty 
READ & full 



is defined as follows: 
-ffull 
-►ACK 
notACK 

-► stall 
read data 



(Rpp == Rjb) empty 



Please note: pipeline effects are not taken into account in this description and 
protocol. 
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2A Data register and event handling 

Data registers are directly addressed, each data register can be individually selected. 
Since a two address opcode form Is used, register operations follow the rule op fa r*^ 
fa, ft An virtual output register is selected by adding, 'ouf behind the opcode. The 
result will be stored in r. and copied to the virtual output register rv as well accordmg 
XoVt\etu\eopout(rvira)^ra,rb. 

Please note, accessing input and (virtual) output registers follow the rules defined in 
Chapter 2.3. 

Rotaiina Select 

Under nonnal conditions data and events are read one tme according to the 
principles of Petri-Nets. Therefore for most applications a one time access per Cyde 
is sufficient. Also per definition one data or event is generated by a Petrl-Net per 
channel and Q«:/e. 

If Function Folding is done in a sequential manner synchronisation is achieved by ; . 
using WAIT and SKIP commands, if multiple accesses to an event are required it can 
be copied by the READE inslroction to the u or v flags which can be used 

successively for multiple commands. 

The Rotating Select starts on the first access to events with the event EO, steps with 
the second access over E1 and E2, to E3 (at the fourth access) and restarts with the 
fifth access at EO again. 



Reset or 
Rdd — Rib 


after 1st 
event access 


after 2nd 
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after 3rd 
event access 


after 4th 
event access 


EO 


El 


E2 


E3 


continue with EO 



Rotating select is supported for reading events and writing events with an explicit 
rotation counter for each read and write. Writing to events copies the value to the u 
flag at the same time, et(v) and ee(v) causes copying to the v flag. 

For each opcode.EO and the Internal flags u and v can be selected explicitly by the 
following selection modes. EO can therefore be easy used as for multiple write event 
accesses per Cycle since there is no need to use the rotating select mode for EO for 
most of the opcodes: 



et (event target) 
es (event source) 



eventt (event target) 
events (event source) 



00 


internal u 


01 


Internal v 


10 


External EvO 


11 


Rotating select: 
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(EvO/Ev1/EV2.0/EV2.2) 
and internal u flag 
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Rotating select: 
B<temalnext 
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j (EvO/Ev1/EV2.0/EV2.2) | 
I and internal u flag 

Event Enable enables or disables writing a flag to an virtual everrt output. However 
the flag will be set in the Internal u or v register anyhow. 



ee (event enable) 



0 


Internal V or u 


1 


Internal v or u & 
Rotating select 
External next 
(EvO/Ev1/EV2.0/EV2.2) 
and Internal V or u flag 



Instructions offering only ALU Internal flags as source for the operations: 

• SAT 

The event addressing supports the selection between the u and vflag. 

Instructions allowing directly addressed event sources using evenff and events: 
. WAIT. SKIP, READE,WRrrEE 

• MERGE, DEMUX, SWAP 

Instructions offering limited addressed event sources and rotating event select {et, 

. SHU SHR, DSHL, DSHR. DSHRU 

• ADD,ADDC.SUB,SUBC 



Event targets ^ , ^ , , ^ , 

Some instructions operate using rotating event select only (ef, es). 

. NOT. SORT. SORTU, OZ, CLZU. AND, OR. XOR. EQ. CMP, CMPU 

Some instructions support Event Enable only (ee): 
. SHL, SHR, DSHU DSHR, DSHRU 
. ADD,ADDC,SUB.SUBC 



2A1 n:1 Transitions 

1-n transitions are not supported within the busses any more. Alternatively simple 
writes to multiple output registers Ro and event outputs Eo are supported. The Virtual 
Output registers (Rv) and Virtual Event (Ev) are translated to real Output registers 
(Ro) and real Events (Eo), whereas a virtual register can be mapped to multiple 
output registers. 

To achieve this a configurable translation table Is implemented for both data registers 
and event registers: 
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Rv 
Ev 


RoO 
EoO 


Ro1 
Eo1 


Ro2 
Eo2 


Ro3 
Eo3 


0 










1 










2 










CO 











Bomple: 

RvO mapped to RoO, Rol 
Rvl mapped to Ro2 
RV2.0 mapped to Ro3 
RV2.2 unused 



Rv 


RoO 


Ro1 


Ro2 


Ro3 


0 


1 


1 


0 


0 


1 


0 


0 


1 


0 


2 


0 


0 


0 


1 


3 


0 


0 


0 


0 



2.4.2 Accessing input and output registers (Rl/Rv) and events (El/Ev) 

Independently from the opcode accessing input or output registers or events Is . 
defined as follows: . .. ' n 



Reading an input register: 



Register status 



empty 



full 



Operation 



watt for data 



read data and continue operation 



/ritinq to an output 
Reqister status 


register: \ . 

Operation 


empty 


write data to register — 


full 


watt until register is cleared and can accept new data 



23 opcode format 

To achieve a small opcode size atwo address code is used. The basic operation is: 

Opra^-raifb 

Source registers can be Ri and Rd, target registers are Ry and Rd A typiral . 
operation targets only Rd registers. If the source register for ra is RiM the target 
register will be Rd[x]. 

The translation Is shown is the following table: 
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--PAET 



Target 


Source fa 


RdO 


RdO 


Rd1 


Rdl 


Rd2 


Rd2 


Rd3 


Rd3 


RdO 


RiO 


Rdl 


Ril 


Rd2 


Ri2 


Rd3 


RiS 



Each operation can target a Virtual Output Register Rv by adding an ouf tag as a 
target identifier to tlie opcode: 

op out fa «- Ta, Tb 

The transfer is now RiM or Rd[xl to RvM as shown in the table below: 



Target 


Source ra 


RvO 


RdO 


Rvl 


Rdl 


RV2.0 


Rd2 


RV2.2 


Rd3 


RvO 


RiO 


Rv1 


Ri1 


RV2.0 


Ri2 


RV2.2 


Ri3 



The opcode format is 16 bit wide, the standard formats are: 



2.6 Clock 

The PAE can operate at a configurable clock frequency of 
1x Bus Clock 
2x Bus Clock 
4x Bus Clock 
[8x Bus Clock] 



2.7 The DF path 

The DataFlow path comprises the data registers Ri2&3 and Ro2&3 as well as the 
events Ei2&3 and Eo2&3. Each of the data registers Ri[n] is combined with an event 
E[n] whereas thfe according busses support different routings. 

By configuration each data patti and its associated event can be dedicated to the 
ALU path. 
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The DF oath supports numerous instructions, whereas the infraction is selec|ed by ... 
configuration and only one of them can be performed during a configuration, function, 
folding is not available. 

The following instnjctions are implemented: 



1. ADD, SUB 

2. NOT,AND,OR,XOR 

3. SHL, SHR. DSHL, DSHR. DSHRU 

4. EQ,CMP,CMPU 

5. MERGE, DEMUX. SWAP 

6. SORT.SORTU 

7. ELUT 



2^ The BP path 

The BvPass path is a simple horizontal network between the input data registers 
BRiO&l and events BEiO&l to the output registers BRoO&l and events BEoO&l. 
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3 Input Output Address Generators (lOAG) 

The lOAGs are located in the RAM-PAEs and share the same registers to the busses. 
An lOAG comprises 3 counters with fonvarded carries. The values of the counters 
and an immediate address Input from the array are added to generate the address. 
One counter offers reverse cany capabilities, 



3.1 Adresslng modes 

Several addressing modes are supported by the lOAG to support typical DSP-like 
addressing: 



3.1.1 Immediate Addressing 

The address is generated in the array and directly fed through the adder to the . 
address output. All counters are disabled and set to 0. 



3.1 .2x0 counting 

Counters are enabled depending on the required dirhension (x-dimensions require x 
counters). For each counter a base address and the step width as well as the 
maximum address are configured. Each carry is fonwarded to the next higher and 
enabled counter; after carry the counter is reloaded with the start address. 
A cany at the highest enabled counter generates an event, counting stops. 



3.1 .3x0 circular 

The operation is exactly the same as for xD counting, with the difference that a carry 
at the highest enabled counter generates an event, all counters are reloaded to their 
base address and continue counting. 



Mode 
Inrimediate 
xD counting 



xD circular 



xD plus immediate 
Stack 



Reverse carry 



Description 

Address generated by the PAE array 
Multidimensional addressing using lOAG internal 
counters 

xD means 10, 20, 3D 

Multidimensional addressing using lOAG internal 
counters, after overflow counters reload with base 
address 

xD plus a value from the PAE array 
decrement after "push" operations 
increment after "read" operations 
Reverse carry for applications such as FFT 
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3.1.4Stack 

One counter (CNT1) is used to decrement after data writes and increment after data 
reads. Tlie base value of the counter can either be configured (base address), or . 
loaded by the PAE array. 

3.1 .5 Reverse cany 

Typically carry is fonvarded from LSB to MSB. Fonwarding the cany to the.oppqsite 
dSection (reverse carry) allows generating address patterns which are ven^ well 
suited for applications like FFT and the fike. The cany Is discarded at MSB. 

For using reverse carry a value larger than LSB must be added to the actual value to 
count, wherefore the STEP register Is used. 



Example: 
BASE = Oh 
STEP = 1000b 



Step 


Counter Value 


1 


b0...00000 


2 


b0...01000 


3 


b0...00100 


4 


bO...O1100 


5 


b0...00010 






16 


b0...01111 


17 


b0...00000 



The counter is implemented to allow reverse carry at least for STEP values of -2, -1 , 
+1,+2. 
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Appendix A 
Opcodes 

Notation: 



name 


explanation 


number 
of bits 




largBi^j 


and related 
source register 


2 


RdO 


00 


Rdi 


A*! 




in 


Rd3 


11 


targetjo 


1 argei ouipui. 

rcgioltl?! aliu 

register 


2 


RoO 


00 


Ro1 


01 


Ro2 


10 


Ro3 


11 


target 


Target register 
and related 

cm irr^P rpni^tAr 

Target will be Rd 
or Ro (if target 
identifier is set) 


.3 


RiO 


000 


Ri1 


001 


Ri2 


010 


Ri3 


oil 


RdO 


100 


Rd1 


101 


Rd2 


110 


Rd3 


111 


target JD 


Target register 
pair and related 
source reaister 
pair 


2 


RoO&l 


00 


Ro2&3 


01 


RdO&1 


10 


Rd2&3 


11 


















cm irpp i 


Source inout 
register 


2 


RiO 


00 


Ri1 


01 


Ri2 


10 


Ri3 


11 


source 


Source register 


3 


RiO 


000 


Ril 


001 


Ri2 


010 


Ri3 


oil 


RdO 


100 


Rd1 


101 


Rd2 


110 


Rd3 


111 


source jd 


Source register 
pair 


2 


RiO&l 


00 


Ri2&3 


01 


RdO&1 


10 


Rd2&3 


11 


rj)airj 


Source register 
and target register 
pair 


2 


target 


source 






RdOM 


RiO 


00 


Rd2&3 


Ri2 


01 
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tid 



val 
vabc 



val2 



u/v 



et 



event 



ee 



Target Identifier 1 



Value 



Value including 
dont care 



2 bit value 



Select flag 1 1 
register Fu or Fv 
event target 12 



event source 



event target (or |3 
source) 



event enable 



RdO&l RdO JO 



Rd2&3 




Rd2 11 



Internal Register 



Internal & 
External Register 



one bit value 



00 



01 



10 



11 



00 



01 



10 



11 



00 



01 



10 



11 



00 



01 



10 



11 



000 



001 



010 



oil 



100 



101 



110 



111 



X 



00 



01 



10 



11 



Fu 



Fv 



Internal u 



Internal v 



External EO 



External 
next 

(E1/E2/E3) 



Internal u 



Internal v 



External EO 



External 
next 

(E1/E2/E3) 



Internal u 



Internal v 



EO 



El 



E2 



E3 



External 
next 

(E1/E2/E3 I 



Internal 



Internal & 

External 

next 

(E1/E2/E3) 
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0123456 

NOP 

000000 


7 
0 


8 

0 


9 
0 


10 

0 


11 
0 


12 
0 


13 

d 


14 
n 


.15 
n 

u 


IF 


OF 1 


Comment | 
No Operation | 


READ 
000000 


0 


targ 


Bt r 


0 


sourcej 


0 


1 


0 






Read packet from input port 


WRITE 
000000 


1 


target_o 


0 


source 


1 


0 








iviUVt 
000000 


0 


target r 


1 


source r 


0' 


1 


0 




1 


Move data between register | 

1 


iXJfMJ 

000000 


1 


target^r 


1 


0 


0 


0 


1 


0 






Load register with constant 






C( 


snsta 


tnt 






Saturate ff cany | 


SAT 
000000 


0 


target_r 


0 


0 


0 


1 


0 


0 


u 




'0 if previous command was 
SUBC 

'1 if previous command was 
ADDG 


SETUV 
uuuuuu 


0 


vai 




0 


0 


1 


1 


0 


0 




1 Set Flags uf and vr | 

U/V 


SWAPUV 
000000 


0 


0 


0 


1 


0 


1 


1 


0 


0 




1 Swap u and V flag | 

uyv 


NOT 
000000 


tid 




large 


t 


1 


et 


0 


0 




1 1 

u 


JR 

000000 




adr7 


0 


1 




1 Jump relative | 


JL 

000000 


adr7 


1 


1 






1 Jump relative if RIc is not zero | 












MERGE 
000001 


tid 


targetj) 


source J} 


event * 


0 


u 






DEMUX 
000001 


tid 


target J) 


sourcej) 


event 


1 


u 


1 1 


SWAP 
000010 


tid 


targetj) 


source p 


event 


0 


u 


1 1 


WATT 
000010 


0 


vabc 


0 


0 


event 


• 1 


u 


1 Walt for incoming event "| 


SKIP 
000010 


0 


valx 


0 


1 


event 


1 


u 


I Wait for incoming event | 


EOPTR 
000010 


0 


val2 


1 


0 


0 


0 


0 


1 






Set event output pointer 


EIPTR 






















Set event input pointer 
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H 



012345 


6 


7|8 


9 1 10 11 


12|13 


14 


15 


IF 


OF 


Comment 


SHL 
100000 


tid 




source 


es 


e6(u} 


ee(v) 


U 


U/v 




SHR 
100001 


lid 


rj)air_t 


source 


es 


ee(u) 


ee(v) 


U 


uyv 




DSHL 
iOOOlO 


Ud 


rj)air_t 


source 


es 


ee(u) 


ee(v) 


u 


UN 




DSHR 
100011 


tid 


r J)air_t 


source 


es 


ee(u) 


ee(v) 


u 


UN 




DSHRU 
101000 


tid 


rjair^t 


source 


es 


ee(u) 


ee(\4 


u 


UN 




ASI 

101001 

101010 

101011 

101100 

101101 

101110 








Appfication specific Instructions 




01234 


5 


6|7|8 


9 10|11 


12| 13 


14 


15 


IF 


OF 


Comment 


ADD 
11000 


tid 


target 


source 


es 


ee(u) 


ee(v) 


U 


UN 




ADDC 
11001 


tid 


target 


source 


es 


ee(u) 


ee(v) 


U 


UN 




SUB 
11010 


tid 


target 


source 


es 


ee(u) 


ee(v) 


U 


UN 




SUBC 
11011 


tid 


target 


source 


es 


ee(u) 


ee(v) 


U 


UN 
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000010 


0 


val2 


1 


0 


0|0|1 


1 






» • '> 


READE 
000010 


0 


0 


u/v 


1 


1 


event 


1 


u/v 




Read Event to : 


WRITEE 
000010 


0 


1 


ufs/ 


1 


1 


event 


1 




UN 


Write Event from U/V j 


SORT 
000011 












Sort two data packets | 


tid 


targetj) 


source j) 


et(u) 


et(v) 




u/v 




SORTU . 
uuuiuu 


tid 


targetj) 


source J) 


et{u) 


et{v) 




1 Sort twoiunsigned data | 
y/V packets. 


000101 


tid 


target 




event 


0 


0 




U 


1 Count leading zeros | 


cuu 

000101 


tid 


target 


event 


0 


1 




U 


Count leading zeros unsigned 


AND 
000110 


tid 


target 


source 


et(u) 




1 1 

U 


OR 

000111 


tid 


target 


source 


. et(u) 




1 1 

u 


XOR 
001000 


tid 


target 


source 


8t(u) 




1 1 

u 


EQ 

001001 


tid 


target 


source 


et(v) 




[Equal 1 


CMP 
001001 


tid 


r_pairj 


source j> 


et(u) 


et(v) 




1 1 

U/V 


CMPU 
001010 


tid 


r_pair_t 


source_j3 


et(u) 


et(v) 




1 1 

u/v 


BSHL 
001011 


tid 


rj)air_J 


0 


source 


0 


0 






{Barrel Shift left | 


Donn 
001011 


fid 


rj)airj 


0 


source 


0 


1 




jBamel Shift right | 


BSHRU 
001011 


tid 


repair J' 


0 


source 


1 


0 




1 Barrel ShHt right unsigned | 


MUL 
001011 


tid 


rjjairj 


1 


source 


0 


0 




1 1 


MULU 
001011 


tid 


rjjairj 


1 


source 


0 


1 




1 1 


DIV 
001011 


tid 


rj>a\Tji 


1 


source 


1 


1 




1 1 

b 
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Claims 

1. A data processing unit having a plurality of cells, in 
particular coarse-grained logic cells, interconnected 

20 and/or interconnectable for data processing wherein at 

least one cell, preferably a number of cells have in- 
struction storage means for storing instructions to be 
executed so as that said coarse-grained cells form a plu 
rality of sequencers within said array. 

25 

2. A method for operating data processing in an array com- 
prising a plurality of logic cells, in particular coarse 
grained logic cells interconnected and/or intercon- 
nectable for data processing, wherein data are trans- 

30 ferred into cells from an input and/or from other cells 

via busses, characterised in that at least some of the 
busses are used for effecting a configuration of said 

2& 
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cells, in particular during runtime and/or without ef- 
fecting cells not to be configured. 

Method according to claim 2, wherein said busses are used 
with a frequency different from the frequency of data 
processing in at least some of the cells. 
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